Model Validation:

My model so far has achieved a Validation Adjusted R^2 of 0.8327 and a Validation R^2 of 0.8596 .

The original adjusted R^2 was 0.9134.

Code and Table of Results:

library(readr)
library(ggplot2)
library(pander)
library(tidyverse)
library(plotly)
library(reshape2)

train <- read.csv("../Data/train.csv", stringsAsFactors = TRUE)

train <- train %>%
  mutate(TotalSF = X1stFlrSF + X2ndFlrSF + TotalBsmtSF,
         RichNbrhd = case_when(Neighborhood %in% c("StoneBr", "NridgHt", "NoRidge") ~ 1, TRUE ~ 0),
         Alley = replace_na(as.character(Alley), "None"),
         Alley = as.factor(Alley))

set.seed(121)
num_rows <- 1000
keep <- sample(1:nrow(train), num_rows)

mytrain <- train[keep, ]
mytest <- train[-keep, ]

lm_model <- lm(SalePrice ~ TotalSF + RichNbrhd + YearBuilt + WoodDeckSF + FullBath + BsmtQual + Neighborhood + HouseStyle + OverallQual + OverallCond + BsmtCond + TotalBsmtSF +
                 TotalSF:RichNbrhd + TotalSF:Fireplaces + TotalSF:Neighborhood + TotalSF:OverallCond + TotalSF:TotalBsmtSF, data=mytrain)
pander(summary(lm_model))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -1111552 176639 -6.293 4.868e-10
TotalSF 80.41 39.38 2.042 0.04147
RichNbrhd -121784 113843 -1.07 0.285
YearBuilt 512.8 69.78 7.348 4.527e-13
WoodDeckSF 16.41 6.951 2.36 0.01847
FullBath 1206 2543 0.474 0.6356
BsmtQualFa -33407 7338 -4.553 6.03e-06
BsmtQualGd -36587 3914 -9.347 6.933e-20
BsmtQualTA -35416 4833 -7.328 5.194e-13
NeighborhoodBlueste 21566 48495 0.4447 0.6566
NeighborhoodBrDale 86632 120906 0.7165 0.4739
NeighborhoodBrkSide 119843 111929 1.071 0.2846
NeighborhoodClearCr 63954 118041 0.5418 0.5881
NeighborhoodCollgCr 86155 110955 0.7765 0.4377
NeighborhoodCrawfor 136357 111803 1.22 0.2229
NeighborhoodEdwards 167508 110824 1.511 0.131
NeighborhoodGilbert 52224 112179 0.4655 0.6417
NeighborhoodIDOTRR 94784 115410 0.8213 0.4117
NeighborhoodMeadowV 134033 112500 1.191 0.2338
NeighborhoodMitchel 117917 111938 1.053 0.2924
NeighborhoodNAmes 136988 110718 1.237 0.2163
NeighborhoodNoRidge 56268 35251 1.596 0.1108
NeighborhoodNPkVill 166470 172379 0.9657 0.3344
NeighborhoodNridgHt 20768 34308 0.6053 0.5451
NeighborhoodNWAmes 99850 111689 0.894 0.3716
NeighborhoodOldTown 131950 110993 1.189 0.2348
NeighborhoodSawyer 140505 111575 1.259 0.2083
NeighborhoodSawyerW 101359 111589 0.9083 0.3639
NeighborhoodSomerst 70926 111416 0.6366 0.5246
NeighborhoodSWISU 160927 112443 1.431 0.1527
NeighborhoodTimber 36912 112948 0.3268 0.7439
NeighborhoodVeenker -115610 126479 -0.9141 0.3609
HouseStyle1.5Unf 3455 8500 0.4064 0.6845
HouseStyle1Story 9225 3895 2.369 0.01807
HouseStyle2.5Fin -582 13263 -0.04388 0.965
HouseStyle2.5Unf -121.3 11037 -0.01099 0.9912
HouseStyle2Story -219.6 3566 -0.06159 0.9509
HouseStyleSFoyer 12212 6876 1.776 0.07609
HouseStyleSLvl 15405 5562 2.77 0.005724
OverallQual 10408 1152 9.033 9.987e-19
OverallCond -1166 2703 -0.4313 0.6663
BsmtCondGd 2819 6253 0.4508 0.6522
BsmtCondPo 3826 19550 0.1957 0.8449
BsmtCondTA 3185 5051 0.6306 0.5284
TotalBsmtSF -15.57 10.31 -1.511 0.1311
TotalSF:RichNbrhd 48.05 39.58 1.214 0.2251
TotalSF:Fireplaces 3.728 0.5821 6.403 2.446e-10
TotalSF:NeighborhoodBrDale -34.98 47.49 -0.7365 0.4616
TotalSF:NeighborhoodBrkSide -37.8 39.79 -0.95 0.3424
TotalSF:NeighborhoodClearCr -15.75 41.12 -0.383 0.7018
TotalSF:NeighborhoodCollgCr -25.43 38.95 -0.653 0.5139
TotalSF:NeighborhoodCrawfor -39.72 39.18 -1.014 0.311
TotalSF:NeighborhoodEdwards -64.66 38.92 -1.661 0.09699
TotalSF:NeighborhoodGilbert -14.13 39.51 -0.3577 0.7206
TotalSF:NeighborhoodIDOTRR -29.52 42.58 -0.6934 0.4882
TotalSF:NeighborhoodMeadowV -63.43 40.57 -1.563 0.1183
TotalSF:NeighborhoodMitchel -45.37 39.4 -1.152 0.2498
TotalSF:NeighborhoodNAmes -50.89 38.86 -1.309 0.1907
TotalSF:NeighborhoodNoRidge -25.85 9.747 -2.652 0.008144
TotalSF:NeighborhoodNPkVill -69.39 71.88 -0.9654 0.3346
TotalSF:NeighborhoodNridgHt -10.28 9.931 -1.035 0.3011
TotalSF:NeighborhoodNWAmes -37.86 39.17 -0.9668 0.3339
TotalSF:NeighborhoodOldTown -49.08 39.04 -1.257 0.209
TotalSF:NeighborhoodSawyer -53.62 39.38 -1.362 0.1736
TotalSF:NeighborhoodSawyerW -32.78 39.16 -0.837 0.4028
TotalSF:NeighborhoodSomerst -15.67 39.12 -0.4005 0.6889
TotalSF:NeighborhoodSWISU -58.88 39.67 -1.484 0.138
TotalSF:NeighborhoodTimber -9.162 39.46 -0.2322 0.8164
TotalSF:NeighborhoodVeenker 51.96 44.09 1.178 0.239
TotalSF:OverallCond 3.674 1.09 3.37 0.0007826
TotalSF:TotalBsmtSF -0.008073 0.002233 -3.615 0.0003171
Fitting linear model: SalePrice ~ TotalSF + RichNbrhd + YearBuilt + WoodDeckSF + FullBath + BsmtQual + Neighborhood + HouseStyle + OverallQual + OverallCond + BsmtCond + TotalBsmtSF + TotalSF:RichNbrhd + TotalSF:Fireplaces + TotalSF:Neighborhood + TotalSF:OverallCond + TotalSF:TotalBsmtSF
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
969 24841 0.9134 0.9066
predicted <- predict(lm_model, newdata=mytest)

predicted <- ifelse(is.na(predicted), mean(predicted, na.rm = TRUE), predicted)

ybar <- mean(mytest$SalePrice)

SSTO <- sum((mytest$SalePrice - ybar)^2)
SSE <- sum((mytest$SalePrice - predicted)^2)

r_squared <- 1 - SSE / SSTO
n <- nrow(mytest)
p <- length(coef(lm_model))
adj_r_squared <- 1 - ((n - 1) / (n - p -1)) * (SSE / SSTO) 

validation_results <- data.frame(
  Model = "My Model",
  `Original R^2` = summary(lm_model)$r.squared,
  `Original Adj. R^2` = summary(lm_model)$adj.r.squared,
  `Validation R^2` = r_squared,
  `Validation Adj. R^2` = adj_r_squared
)

knitr::kable(validation_results, digits = 4)
Model Original.R.2 Original.Adj..R.2 Validation.R.2 Validation.Adj..R.2
My Model 0.9134 0.9066 0.8596 0.8327

Visualizing the model:

SalePrice vs TotalSF:

ggplot(mytrain, aes(x = TotalSF, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "SalePrice vs TotalSF", x = "Total Square Feet", y = "Sale Price") +
  theme_minimal()

This scatter plot is visualizing the relationship between a house’s total square footage and its sale price, labeled SalePrice vs TotalSF The grey dots are all each corresponding to a specific house. There’s a clear upward trend, indicating a positive correlation between these two variables: as the square footage increases, the sale price generally increases as well. A red line has been fitted to the data, suggesting a linear relationship and potentially representing a linear regression model. While the trend is evident, the scatter of the points shows that other factors influence sale price beyond just square footage.

My 3D Scatterplot of SalePrice vs TotalSF & OverallQual

library(plotly)
plot_ly(mytrain, x = ~TotalSF, y = ~OverallQual, z = ~SalePrice, type = "scatter3d", mode = "markers",
        marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
  layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
         scene = list(xaxis = list(title = "TotalSF"),
                      yaxis = list(title = "OverallQual"),
                      zaxis = list(title = "SalePrice")))

This 3D scatter plot illustrates the relationship between Sale Price, Total Square Feet (TotalSF), and Overall Quality (OverallQual) of houses, revealing a strong positive correlation across all three variables. As both TotalSF and OverallQual increase, SalePrice tends to rise, indicating that larger, higher-quality homes command higher prices. The data points cluster in the lower ranges, but distinct outliers, especially those with high SalePrice and OverallQual, suggest premium properties.

Another 3D Scatterplot of SalePrice vs RichNbrhd & YearBuilt

library(plotly)
plot_ly(mytrain, x = ~RichNbrhd, y = ~YearBuilt, z = ~SalePrice, type = "scatter3d", mode = "markers",
        marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
  layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
         scene = list(xaxis = list(title = "RichNbrhd"),
                      yaxis = list(title = "YearBuilt"),
                      zaxis = list(title = "SalePrice")))

This 3D scatter plot illustrates the relationship between SalePrice, YearBuilt, and RichNbrhd. It suggests a potential trend where newer houses (higher YearBuilt) in richer neighborhoods (higher RichNbrhd) tend to have higher SalePrices. The data points show a concentration towards the lower end of RichNbrhd, indicating that most properties are not in the “rich” neighborhoods. However, there is a noticeable spread of SalePrices across different YearBuilt values, with some newer homes showing significantly higher prices.

One final 3D Scatterplot showing Saleprice vs OverallCond & FullBath

library(plotly)
plot_ly(mytrain, x = ~OverallCond, y = ~FullBath, z = ~SalePrice, type = "scatter3d", mode = "markers",
        marker = list(size = 3, color = ~SalePrice, colorscale = "Viridis")) %>%
  layout(title = "3D Scatter: SalePrice vs TotalSF & OverallQual",
         scene = list(xaxis = list(title = "RichNbrhd"),
                      yaxis = list(title = "OverallCond"),
                      zaxis = list(title = "FullBath")))

This 3D scatter plot visualizes SalePrice against RichNbrhd and OverallCond, revealing a potential positive correlation between SalePrice and OverallCond. As OverallCond increases, SalePrice tends to rise, indicating higher-condition homes command higher prices. Data points cluster in the lower OverallCond ranges, with outliers at high SalePrice and OverallCond suggesting premium properties.

Diagnostic Plots:

par(mfrow=c(2,3))
plot(lm_model, which = c(1,2,4,5, 6))
plot(lm_model$residuals)

Residuals vs Fitted: The dots should be spread out randomly. If they’re not, it means our predictions might be off, especially for certain values. Some points, like 8080 and 1325, are far away.

Q-Q Residuals: This checks if the errors are normal. If the dots follow the line, it’s good. Some dots are a bit off, especially 8080 and 1325.

Cook’s Distance: This shows if any single point has a big influence. Point 524 has a very big influence.

Residuals vs Leverage: This finds points that are both far away from the others and have big errors. Points 524, 497, and 1183 are like this.

Cook’s dist vs Leverage: Another way to see those influential points. Point 524 is still the biggest problem.

Residuals over Index: This checks if the errors are random over the data. They look random, but again, 8080 and 1325 are way off.

Discussion of Validation results and Coefficients:

Validation Results:

Original Adjusted R² (0.9066): Indicates the model explains 90.66% of the variance in house prices, adjusting for predictors.

Validation Adjusted R² (0.8327): Shows how well the model generalizes to new data. A decrease from the original suggests slight overfitting, but 0.8327 still indicates strong model performance on unseen data.

Validation R² (0.8596) measures the proportion of variance in the validation data explained by the model. A high value indicates good predictive performance, though it doesn’t adjust for the number of predictors, so it’s best considered alongside adjusted R².

Coefficents Interpretation:

TotalSF: Indicates how much SalePrice changes per additional square foot of total space.

RichNbrhd: Shows the price difference for houses in wealthy neighborhoods compared to others.

YearBuilt: Reflects how much the sale price increases per additional year since the house was built.

OverallQual: Denotes the overall quality of a given home. Similar to OverallCond, which indicates the condition of a home.

Interaction terms: For example, TotalSF:RichNbrhd shows how the effect of square footage on price changes in wealthy neighborhoods.

There are other coefficents used, but I felt these were the most impactful on my model and led to me acheiving the highest R-squared value.